166        Bioinformatics

which one is lowly transcribed or even not transcribed at all. The RNA-Seq count data is

used as an alternative to microarray data in eQTL analysis. QTL analysis is a statistical

method that links phenotypic data (trait measurements) and genotypic data (markers usu-

ally SNPs) in an attempt to explain the genetic basis of variation in complex traits. On the

other hand, eQTL analysis links markers (genotype) with gene expression levels measured

in a large number of individuals and the data is modeled using generalized linear models.

RNA-Seq is a powerful tool for detecting alternative splice patterns, which are important

to understand development of human diseases. Paired-end sequencing enables sequence

information from both ends and help in detecting splicing patterns without requirement

for previous knowledge of transcript annotations. The single-molecule, real-time (SMRT)

sequencing is the core technology powering long-read sequencing that allows examination

of splicing patterns and transcript connectivity in a genome-scale manner by generating

full-length transcript sequences.

RNA-Seq is also used for fusion gene detection. A fusion gene is a gene made by join-

ing two different genes. It is usually created when a gene from one chromosome moves to

another chromosome. The fusion gene is transcribed into mRNA that will be translated

into fusion protein. The fusion proteins implicate usually in some types of cancer includ-

ing leukemia; soft tissue sarcoma; cancers of the prostate, breast, lung, bladder, colon,

and rectum; and CNS tumors. Paired-end RNA-Seq data are usually used for fusion gene

detection [4].

Other kinds of RNA-Seq applications include integration of RNA-Seq data analysis with

other technologies.

The library preparation of the mRNA is similar to that of DNA. However, mRNA must

be separated from other types of RNA by enrichment technique which uses either PCR

amplification or the depletion of the other types of RNA. The RNA must be converted

into complementary DNA (cDNA) by reverse transcription before library preparation. As

DNA library, the cDNA library preparation involves fragmentation and adaptor ligation

to each end of the fragments. The cDNA fragments then are sequenced with the sequenc-

ing machine and the sequencing can either be single end (forward strand only) or paired

end (forward and reverse strands). The sequencing generates sequence data in a form of

reads in FASTQ files. Those reads are the sequenced fragments of the expressed genes in

the sample.

5.3  RNA-SEQ DATA ANALYSIS WORKFLOW

The first steps of the RNA-Seq are the same as in other sequencing applications. The

sequencing raw data (usually in FASTQ files) must pass through the quality control steps

that were discussed in detail in Chapter 1. In general, the steps of the workflow include

quality control, read alignment, read quantification, differential expression, annotation,

and interpretation (Figure 5.1).

5.3.1  Acquiring RNA-Seq Data

The RNA-Seq raw data are sequence reads produced by a sequencing instrument. RNA-

Seq sequence raw data for some projects are available in public databases and can be